Evaluation and Comparison of Open Source Bibliographic Reference Parsers: A Business Use Case

نویسندگان

  • Dominika Tkaczyk
  • Andrew Collins
  • Paraic Sheridan
  • Jöran Beel
چکیده

Bibliographic reference parsing refers to extracting machinereadable metadata, such as the names of the authors, the title, or journal name, from bibliographic reference strings. Many approaches to this problem have been proposed so far, including regular expressions, knowledge bases and supervised machine learning. Many open source reference parsers based on various algorithms are also available. In this paper, we apply, evaluate and compare ten reference parsing tools in a specific business use case. The tools are Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse, and we compare them in both their outof-the-box versions and tuned to the project-specific data. According to our evaluation, the best performing out-of-the-box tool is GROBID (F1 0.89), followed by CERMINE (F1 0.83) and ParsCit (F1 0.75). We also found that even though machine learning-based tools and tools based on rules or regular expressions achieve on average similar precision (0.77 for ML-based tools vs. 0.76 for non-ML-based tools), applying machine learning-based tools results in the recall three times higher than in the case of nonML-based tools (0.66 vs. 0.22). Our study also confirms that tuning the models to the task-specific data results in the increase in the quality. The retrained versions of reference parsers are in all cases better than their out-of-the-box counterparts; for GROBID F1 increased by 3% (0.92 vs. 0.89), for CERMINE by 11% (0.92 vs. 0.83), and for ParsCit by 16% (0.87 vs. 0.75).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Open Source Learning Management Softwares and Presenting a Native Evaluation Tool

Introduction: Nowadays all educational institutes are trying to use technology in their structure. This effort has been faced with different barriers, including cost, time, and support. Therefore, using open source softwares can partially help us in using technology. In this article, we review main features of several open source learning management softwares, while presenting a tool which incl...

متن کامل

Development of a conceptual model for asthma management system in primary care

Introduction: Asthma is uncontrolled in more than half of asthma patients due to inadequate and incorrect management. The main reasons for inadequate management are non-adherence, inadequate knowledge of a general practitioner about patientchr('39')s clinical condition, and not following asthma management guidelines The purpose of this study was to develop a conceptual model for the asthma mana...

متن کامل

A Comparison of Chinese Parsers for Stanford Dependencies

Stanford dependencies are widely used in natural language processing as a semanticallyoriented representation, commonly generated either by (i) converting the output of a constituent parser, or (ii) predicting dependencies directly. Previous comparisons of the two approaches for English suggest that starting from constituents yields higher accuracies. In this paper, we re-evaluate both methods ...

متن کامل

Introducing hard rock TBMs’ downtime analysis model with reference to past case histories’ data

The study of downtime and subsequently machine utilization in a given project is one of the major requirements of an accurate estimation of TBM performance and daily advance rate. Interestingly, while it is very common to report the components of downtime when discussing a tunneling project in the literature; there has not been a great amount of in-depth studies on this topic in the recent year...

متن کامل

A Comparison of the Customer Relationship Management Strategies of Nigerian Banks and Insurance Companies

This study aimed at finding out if banks and insurance companies in Nigeria use CRM as a marketing strategy as well as whether these organizations have employed the same variables to achieve Customer Relationship Management. Relevant literature was reviewed and a model consisting of seventeen variables was conceptualized and tested by means of empirical data collected through a questionnaire su...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1802.01168  شماره 

صفحات  -

تاریخ انتشار 2018